Augmenting words with linguistic information for n-gram language models
نویسندگان
چکیده
The main goal of the present work is to explore the use of rich lexical information in language modelling. We reformulated the task of a language model from predicting the next word given its history to predicting simultaneously both the word and a tag encoding various types of lexical information. Using part-of-speech tags and syntactic/semantic feature tags obtained with a set of NLP tools developed at Microsoft Research, we obtained a reduction in perplexity compared to the baseline phrase trigram model in a set of preliminary tests performed on part of the WSJ corpus.
منابع مشابه
Beyond N-Grams: Can Linguistic Sophistication Improve Language Modeling?
It seems obvious that a successful model of natural language would incorporate a great deal of both linguistic and world knowledge. Interestingly, state of the art language models for speech recognition are based on a very crude linguistic model, namely conditioning the probability of a word on a small fixed number of preceding words. Despite many attempts to incorporate more sophisticated info...
متن کاملHierarchical Bayesian Language Modelling for the Linguistically Informed
In this work I address the challenge of augmenting n-gram language models according to prior linguistic intuitions. I argue that the family of hierarchical Pitman-Yor language models is an attractive vehicle through which to address the problem, and demonstrate the approach by proposing a model for German compounds. In an empirical evaluation, the model outperforms the Kneser-Ney model in terms...
متن کاملIntroducing linguistic constraints into statistical language modeling
Building robust stochastic language models is a major issue in speech recognition systems. Conventional word-based n-gram models do not capture any linguistic constraints inherent in speech. In this paper the notion of function and content words (open/closed word classes) is used to provide linguistic knowledge that can be incorporated into language models. Function words are articles, preposit...
متن کاملLanguage identification incorporating lexical information
In this paper we explore the use of lexical information for language identification (LID). Our reference LID system uses language-dependent acoustic phone models and phone-based bigram language models. For each language, lexical information is introduced by augmenting the phone vocabulary with the N most frequent words in the training data. Combined phone and word bigram models are used to prov...
متن کاملMulti Class-based n-gram Language Model for New Words Using Web Data
Out-of-vocabulary (OOV) words cause a serious problem for automatic speech recognition (ASR) system. Not only it will be miss-recognized as an in-vocabulary word with similar phonetics, but the error will also affect nearby words to make errors. Language models (LMs) for most of open vocabulary ASR systems treat OOV words as one entity, ignoring the linguistic information. In this paper we pres...
متن کامل